Read me

A Flash based dashboard would be generated in this HTML file, which could be only only displayed when hosted on a web server, or is placed in a directory which has been added to the trusted sources in the [security settings of Macromedia]. Here are two ways ensuring you could see the dashboard correctly.

  1. If you are using Safari browser:

Please go ‘Safari Preferences’ -> ‘Security’ -> check ‘Enable JavaScript’ and ‘Allow Plug-ins’ -> ‘Plug-in Settings’ -> check ‘Adobe Flash Player’ -> select ‘on’ in ‘when visiting other websites’ -> ‘done’

  1. If you are using Chrome browser:

Please go ‘Chrome Settings’ -> ‘Advanced’ -> ‘Privacy and Security’ -> ‘Content Settings’ -> ‘Flash’ -> ‘Allow’ -> ‘Add’ -> enter your dictionary contains this HTML file

Sorry for the inconvience.


Introduction

This is the final project of 12-709 Data Analytics for Engineered Systems. We are expected to implement an end-to-end data analysis including data collection, data organization, data cleansing, data transformation, data analysis and data visualization.

In this data analysis project, I choose Gapminder World data to explore. It shows the health and wealth of all countries from late 20th century to 21st century. There is huge information about the world health and wealth condition hidden in this dataset and more than hundreds of the countries’ data has been included in the dataset.

How the world evolves and changes over the centutries and how does the world look like now? These questions are not just related to the social sciences but also related to economics and would compact other engineering developments. So I would like to use this topic to explore, and try to generate stories from the data and explain the hidden principle behind the data.

Task Description

There are some key questions we could raise about this topic, and following tasks are what we are going to explore in this project:

  • time series data of world health and wealth
  • health/wealth condition differences between countries in the world
  • how countries evolve over time
  • hidden relationship between the health and wealth
  • creating dashboard

Data Description

In this section, I would identify the datasets that would used in this final project.

The file project_data.rda contains the following data frames, which all pertain to global health statistics

  • pop.by.age: contains the population for 138 countries, for the years 1950-2050 (using projected populations after 2007), broken up into three age groups (0-19 years, 20-60 years, and 61+ years)
  • gdp.lifeExp: the per capita GDP (a measure of economic wealth) and life expectancy for these countries, for the years 1952-2007
  • gdp.lifeExp.small: a small subset of the years in gdp.lifeExp
  • continents: the continent of each country in the previous datasets

This data was made famous by Hans Rosling (1948-2017) and his Gapminder Foundation. You can see one of his videos here: https://www.youtube.com/watch?v=BPt8ElTQMIg


1. World Population Evolution

There are several libraries we need:

library(ggplot2)
library(mclust)
library(plyr)
library(dplyr)
library(reshape2)
library(tidyr)
library(knitr)
library(splines)
library(googleVis)
library(RJSONIO)

Let us load our data first.

load('/Users/apple/Desktop/12709/project/project_data.rda')

Then we need to rename the age group labels into 0-19 Years old, 20-60 Years old and 61+ Years old.

colnames(pop.by.age)=c('country','year','0-19 Years old','20-60 Years old','61+ Years old','continent')

1.1 Explore data

Firstly, show how the population of all the countries around the world changes over time grouped by three different age groups. To better compare and cluster all the countries population, we use the percentages to scale the dataset.

# mutate the total population for each row in the dataset
sum.population=pop.by.age[,c("0-19 Years old","20-60 Years old","61+ Years old")]
sum.population=cbind(sum.population, sum.pop=rowSums(sum.population))
pop.by.age$pop.sum=sum.population$sum.pop
# melt the data to have the age.group label
pop.by.age.melt=melt(pop.by.age,id.vars = c('country','year','continent','pop.sum'))
# rename the label into 'age.group'
colnames(pop.by.age.melt)[5]='age.group'
# mutate the percentage of the population
pop.by.age.melt$percent=pop.by.age.melt$value/pop.by.age.melt$pop.sum
# show how the population of all the countries changes over time grouped by three different age groups
ggplot(data=pop.by.age.melt, mapping=aes(x=year, y=percent,group=country,color=continent)) + geom_line()+facet_wrap('age.group')+labs(x = 'Year', y = 'Percentage', title='Population by Age (in percentage), 1950-2050 \n (All Countries)',caption='Figure 1.1')+theme(plot.title = element_text(hjust = 0.5),plot.caption =element_text(hjust = 0.5))

We could see from the figure above that different countries might have different variation trend, so now let us use clustering method to divide all of these countries into four groups according to their various behaviors during the evolutionary process.

1.2 Clustering

First, reshape the dataset to call Mclust method.

# melt the data with two features 'age.group' and 'percentage'
pop.by.age.clust=pop.by.age.melt[,-c(4,6)]
# mutate the interaction between the year and the age.group
pop.by.age.clust = mutate(pop.by.age.clust, year.age.group = interaction(year, age.group))
# the spread command works best if there are no extraneous columns, so we only select a subset from the dataset
pop.by.age.clust = subset(pop.by.age.clust, select = c('country', 'year.age.group', 'percent'))
# spread command 
pop.by.age.clust.spread = spread(pop.by.age.clust, key = year.age.group, value= percent)
# call the Mclust function and set G to 4 (i.e. 4 clusters)
clust=Mclust(pop.by.age.clust.spread[,1:64],G=4)
# create a new column to add the clustering results
pop.by.age.clust.spread$clust=clust$classification
# change the label names for classification
pop.by.age.clust.spread$clust[pop.by.age.clust.spread$clust==1]='Group 1'
pop.by.age.clust.spread$clust[pop.by.age.clust.spread$clust==2]='Group 2'
pop.by.age.clust.spread$clust[pop.by.age.clust.spread$clust==3]='Group 3'
pop.by.age.clust.spread$clust[pop.by.age.clust.spread$clust==4]='Group 4'

Then we could plot the data set with clustering results and we should also add the previous plot as the references.

# first make a copy
pop.by.age.melt.2=pop.by.age.melt
# then merge the dataset to add cluster labels to the dataset
label.df=pop.by.age.clust.spread[,c(1,65)]
pop.by.age.melt.2=merge(pop.by.age.melt.2,label.df,by='country')
# show the plot with pooled data
plot=ggplot(data=pop.by.age.melt.2,mapping=aes(x=year,y=percent,group=country,color=continent))+geom_line(data=pop.by.age.melt,mapping=aes(group=country),color='grey',alpha=0.2)+geom_line() +labs(x = 'Year', y = 'Percentage', title='Population by Age (in percentage), 1950-2050 \n (All Countries)',caption='Figure 1.2')+theme(plot.caption=element_text(hjust = 0.5),plot.title = element_text(hjust = 0.5))+facet_grid(clust ~ age.group)
# add the annotation
len=12
vars=data.frame(expand.grid(levels(factor(pop.by.age.melt.2$clust)),levels(pop.by.age.melt.2$age.group)))
colnames(vars)=c('clust','age.group')
dat=data.frame(x=rep(2010,len),y=rep(0.6,len),vars,labs=c('Starts low/Ends low','Starts high/Ends medium','Ends low','Starts high/Ends high','Starts high/Ends low','Starts low/Ends high','Starts medium/Ends medium','Starts low/Ends high','Starts high/Ends high','Starts low/Ends medium','Starts medium/Ends high','Starts low/Ends low'))
dat[1:4,2]=0.1
dat[5:8,2]=0.2
plot+geom_text(aes(x,y,label=labs,group=NULL),color='black',data=dat)

This plot shows how the age demographics are changing over time for all 138 countries in the data set, where we have used the Mclust clustering algorithm to divide the countries up into four groups (note that the clusters differ slightly from the continents):

  • group 1: a group whose age demographics are younger than the other countries, for the entire time span
  • group 4: a group whose age demographics are older than the other countries, particularly in later years
  • groups 2 and 3: these groups initially are young and look more like group 1 in 1950, but in later years their demographics shift towards group 4. This might be due to improvements in living quality for these countries. Group 3 shifts further than group 2.

The clusters show that the countries could be divided into four different groups with their different behaviors on the age demographics changing over time (changing trends are represented as the annotations in each facet plot).

It is also worth noting that Group 1 contains most of the European countries where the percentage of population in younger age keeps lower while the old populations are always higher than other countries. And most countries in Group 4 are African countries showing the reverse. This could also be interpreted by the following plots in Section 2 and Section 3 and the analysis on the improvements in living quality all over the world.


2. World health (life expectancy) and wealth (GDP)

2.1 Data cleansing

First, let us plot the data for all countries. Since two categories (i.e. lifeExp and GDP) have different scales of value, we need to set the scales ‘free’ for each item.

# melt the data
gdp.lifeExp.melt=melt(gdp.lifeExp,id.vars=c(1,2,5))
# add annotation
dat=data.frame(x=1980,y=90000,variable='gdp.per.capita',labs=c('Kuwait'))
# show the plot for all countries
ggplot(data=gdp.lifeExp.melt, mapping=aes(x=year, y=value,group=country,color=continent)) + geom_line()+facet_wrap('variable',scales = 'free')+labs(x = 'Year', y = 'Value', title='Life Expectancy and GDP per Capita, 1952-2007 \n (All Countries)',caption='Figure 2.1')+theme(plot.title = element_text(hjust = 0.5),plot.caption = element_text(hjust = 0.5))+geom_text(aes(x,y,label=labs,group=NULL),color='black',data=dat)

Here is an outlier: Kuwait.

It seems like the development of GDP per capita in Kuwait was obviously having a different pattern against from other countries in the world. The overall trend for the GDP per capita of was growing up over time while Kuwait was not. Kuwait might have wrong record, which made it become an outlier in the whole dataset. Since clustering aims to find the countries with similar patterns of changes of both life expectancy and GDP over time, this outlier, which is a very different kind, would have a strong impact on the clustering performance. So we need to remove it from the dataset to build a better clustering model.

# remove the Kuwait
gdp.lifeExp=subset(gdp.lifeExp,country!='Kuwait')
gdp.lifeExp.melt=melt(gdp.lifeExp,id.vars=c(1,2,5))

2.2 Clustering

Use clustering to divide the countries into groups that had similar changes to life expectancy and GDP over time.

After removed Kuwait form the dataset, the clustering results are shown as below.

gdp.lifeExp.melt.2= mutate(gdp.lifeExp.melt, year.variable = interaction(year, variable))
# the spread command works best if there are no extraneous columns
gdp.lifeExp.melt.2 = subset(gdp.lifeExp.melt.2, select = c('country', 'year.variable', 'value'))
# spread command 
gdp.lifeExp.spread = spread(gdp.lifeExp.melt.2, key = year.variable, value= value)
# call the Mclust function and set G to 4 (i.e. 4 clusters)
clust.2=Mclust(gdp.lifeExp.spread[,1:25],G=3)
gdp.lifeExp.spread$clust=clust.2$classification
# change the label names for classification
gdp.lifeExp.spread$clust[gdp.lifeExp.spread$clust==1]='Group 1'
gdp.lifeExp.spread$clust[gdp.lifeExp.spread$clust==2]='Group 2'
gdp.lifeExp.spread$clust[gdp.lifeExp.spread$clust==3]='Group 3'
# melt the data to have the label
label.df.2=gdp.lifeExp.spread[,c(1,26)]
gdp.lifeExp.melt.2=merge(gdp.lifeExp.melt,label.df.2,by='country')
# add annotations
len=6
vars=data.frame(expand.grid(levels(gdp.lifeExp.melt.2$variable),levels(factor(gdp.lifeExp.melt.2$clust))))
colnames(vars)=c('variable','clust')
# create data frame storing the text contents and locations
dat=data.frame(x=rep(1970,len),y=rep(25,len),vars,labs=c('Starts low/Ends low \n with fluctuation','Starts low/Ends low','Starts medium/Ends medium','Starts medium/Ends medium','Starts high/Ends high','with fluctuation'))
# change some locations
dat[2,2]=40000
dat[4,2]=40000
dat[6,2]=40000
# show the plot with pooled data and annotations
ggplot(data=gdp.lifeExp.melt.2,mapping=aes(x=year,y=value, color=continent))+geom_line(data=gdp.lifeExp.melt,mapping=aes(group=country),color='grey',alpha=0.2)+geom_line(aes(group = country)) +labs(x = 'Year', y = 'Value', title='Life Expectancy and GDP per Capita, 1952-2007 \n (All Countries)',caption='Figure 2.2')+theme(plot.title = element_text(hjust = 0.5),plot.caption = element_text(hjust = 0.5))+facet_grid(variable~clust,scales = 'free')+geom_text(aes(x,y,label=labs,group=NULL),color='black',data=dat)

Let us look at some statistics for different clusters.

# analyze the clustering model
clust.summary=summary(clust.2,parameters = T) 
# look at the probabilities and means:
kable(data.frame(clust.summary$mean),col.names = c('Group 1','Group 2','Group 3'))
Group 1 Group 2 Group 3
country 69.69647 68.58637 81.52483
1952.lifeExp 38.64181 55.50668 57.17531
1957.lifeExp 40.85091 58.14238 59.73890
1962.lifeExp 42.77892 60.36405 61.94179
1967.lifeExp 45.04716 62.18791 64.03310
1972.lifeExp 47.03081 64.09486 66.06212
1977.lifeExp 48.84440 66.07853 68.09769
1982.lifeExp 51.05258 67.85088 69.94686
1987.lifeExp 52.80174 69.40743 71.69604
1992.lifeExp 53.30340 70.68084 72.89370
1997.lifeExp 53.49926 72.03664 74.07241
2002.lifeExp 53.44693 73.32632 75.00575
2007.lifeExp 55.08611 74.42111 76.11241
1952.gdp.per.capita 1044.26687 3780.15323 5399.29125
1957.gdp.per.capita 1138.25888 4486.18583 6541.31105
1962.gdp.per.capita 1245.20581 5184.60953 7755.11988
1967.gdp.per.capita 1386.38020 6150.77321 9949.07312
1972.gdp.per.capita 1521.91824 7365.05568 12820.27612
1977.gdp.per.capita 1545.57399 8366.50534 15375.75194
1982.gdp.per.capita 1611.44942 9033.30778 15943.18500
1987.gdp.per.capita 1578.89024 9731.14666 16666.70495
1992.gdp.per.capita 1563.39021 9893.93017 17462.42512
1997.gdp.per.capita 1629.79993 11015.02676 19661.30053
2002.gdp.per.capita 1771.59017 12154.87238 21503.77506
2007.gdp.per.capita 2095.14811 14202.39739 25323.11331

By extracting the mean values for each group, we could see that Group 1 always contains the countries who has the lowest mean percentage every year in that label (i.e. age group) and Group 3 groups the countries together with the highest mean percentage every year for each age group, and countries in Group 2 lie in the middle. Also, Group 1 and Group 3 countries show some fluctuations in the changing process.


3. How Does Income Relate to Life Expectancy?

So far, we have created some plots, but there are still several disadvantages of the plots above:

Firstly, it is hard to match each line with each country in the plots above. Since each cluster contains many countries from different continents, it is also hard to tell the trajectory for each continent.

And also the dataset above with continuous years tells us more information about the detailed process of the evolving with a dynamic view, but also makes us tend to ignore the final result of from a macro view.

3.1 Compare tow years data

So now let us only look at the data at the beginning year (1952) and the end year (2007):

# remove the Kuwait from the dataset
gdp.lifeExp.small=subset(gdp.lifeExp.small,country!='Kuwait')
# plot the life expectancy against GDP per capita in 1952 and 2007, respectively
ggplot(data=gdp.lifeExp.small,mapping=aes(x=gdp.per.capita,y=lifeExp,color=continent, label=country))+geom_point(size=1)+facet_wrap('year',scales = 'free')+geom_text(size=4,hjust=0, nudge_y =0.1)+labs(x = 'GDP per capita', y = 'Life Expectancy', title='Life Expectancy against GDP per Capita \n (1952 vs. 2007)',caption='Figure 3.1')+theme(plot.title = element_text(hjust = 0.5),plot.caption = element_text(hjust = 0.5))

This plot shows the life expectancy (y-axis) and GDP per capita (x-axis) for each country in both 1952 and 2007.

Each subplot shows that life expectancy at birth, increases at a decreasing rate with respect to GDP per capita (PPP).

The main reason for this non-linear relationship is because people consume both needs and wants. People consume needs in order to survive. Once a person’s needs are satisfied, they could then spend the rest of their money on non-necessities. If everyone’s needs are satisfied, then any increase in GDP per capita would barely affect life expectancy.

3.2 Regression

Rich people live longer?

The relationship between life expectancy and GDP per capita is strong enough to be the basis of a regression model. Simple functions that increase at a decreasing rate include multiplicative (hyperbolas) and logarithmic functions.

The following is R output for a regression model that was fitted to the data:

# select the data in 2017
regression.data=subset(gdp.lifeExp.small,year==2007,select=c(3,4))
# create non-linear function to fit the data
spline.model = lm(lifeExp ~ ns(gdp.per.capita, df=4), data=regression.data)
# show the summary
summary(spline.model)
## 
## Call:
## lm(formula = lifeExp ~ ns(gdp.per.capita, df = 4), data = regression.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.424  -1.663   1.582   4.172  12.035 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   47.039      1.872  25.122  < 2e-16 ***
## ns(gdp.per.capita, df = 4)1   22.860      2.565   8.911 2.92e-15 ***
## ns(gdp.per.capita, df = 4)2   25.475      3.600   7.076 7.13e-11 ***
## ns(gdp.per.capita, df = 4)3   50.571      4.946  10.225  < 2e-16 ***
## ns(gdp.per.capita, df = 4)4   24.201      4.114   5.883 2.98e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.125 on 136 degrees of freedom
## Multiple R-squared:  0.6622, Adjusted R-squared:  0.6522 
## F-statistic: 66.64 on 4 and 136 DF,  p-value: < 2.2e-16

By looking at this summary, we could see that the regression model includes: + An intercept + A hyperbolic term + A linear term The model fits quite well to the data (R-squared statistic of 66.2%). It isn’t necessarily the best model, but it appears to be a fairly good one.

# show the residual plot
plot(spline.model,which=1)

We could see from the residual plot that during the life expectancy from 60 to 70 the residuals are above 0. Following figure show how the model fit the data.

ggplot(data=regression.data,mapping=aes(x=gdp.per.capita,y=lifeExp))+
  geom_point(size=1)+
  geom_smooth(method='lm', formula = y ~ ns(x, df=4))+
  labs(x = 'GDP per capita', y = 'Life Expectancy', title='Regression Model of Life Expectancy against GDP per Capita',caption='Figure 3.2')+theme(plot.title = element_text(hjust = 0.5),plot.caption = element_text(hjust = 0.5))

Shaded (i.e. grey) area tells about the variance, and we could use this model to predict the life expectancy given a certain gdp.per.capita.

3.3 Further exploration

And when comparing the two subplots in two different years, we could simply see from two axes ranges that the world had overall both higher life expectancy and GDP per capita. In 1952, some countries had a life expectancy less than 40 years and the whole world only had income less than 15K dollars. In 2007, almost all of the countries had a life expectancy more than 40 years and GDP per capita was greatly increased.

The overall trends of the life expectancy and GDP per capita all over the world are both increasing. Most of the African countries improved life expectancy a lot but still had relative lower GDP per capita; most countries in Asia were getting both healthier and richer; and Europe kept its both good health condition and high GDP per capita stably.

Since there are a lot of overlaps of the country names on the plot, it is still hard to see each country’s changing pattern. We would do more detailed plots as below.

Firstly, let us calculate the total changes (represented with the growth percentage) of life expectancy and GDP per capita from 1952 to 2007 (represented with the growth percentages), respectively.

# mutate the growth percentages
data.1952=subset(gdp.lifeExp.small,year==1952)
data.2007=subset(gdp.lifeExp.small,year==2007)
lifeExp.change=mutate(data.1952,lifeExp.change=(data.2007$lifeExp-data.1952$lifeExp)/data.1952$lifeExp)
gdp.change=mutate(data.1952,gdp.change=(data.2007$gdp.per.capita-data.1952$gdp.per.capita)/data.1952$gdp.per.capita)
gdp.lifeExp.change=mutate(gdp.change,lifeExp.change=(data.2007$lifeExp-data.1952$lifeExp)/data.1952$lifeExp)

Show the plot as below:

ggplot(data=gdp.lifeExp.change,mapping=aes(x=gdp.change,y=lifeExp.change,color=continent,label=country))+geom_point(size=1)+labs(x = 'GDP per capita growth percentage', y = 'Life expectancy growth percentage', title='Life Expectancy Growth Percentage against GDP per Capita Growth Percentage \n (from 1952 to 2007)',caption='Figure 3.3')+theme(plot.title = element_text(hjust = 0.5),plot.caption = element_text(hjust = 0.5))

The plot shows the life expectancy growth percentage against GDP per capita growth percentage all over the world. Now let us focus on the changes for each continent and we would see that Europe had a relative higher increasing percentage of economics but limited life expectancy growth. And contrast to Europe, who had little variance on both growths, African and American countries had high variance of life expectancy. Asian countries improved their health condition and economics a lot, and some of them developed their economics really fast.


4. Countries trajectory

After the analysis for the whole world and each continent, now let us look at the trajectory for each country. First, add some reference lines into the datasets for further comparisons.

# reference line of global average growth percentage
lifeExp.change=mutate(lifeExp.change,avg.lifeExp.change=mean(lifeExp.change))
gdp.change=mutate(gdp.change,avg.gdp.change=mean(gdp.change))
# continental average
lifeExp.change=ddply(lifeExp.change,'continent',mutate,avg.lifeExp.change.c=mean(lifeExp.change))
gdp.change=ddply(gdp.change,'continent',mutate,avg.gdp.change.c=mean(gdp.change))

To get a better and more detailed understanding of each country’s evolution, we may need to show the growth percentages of each country’s GDP per capita and life expectancy separately.

Here is the growth percentages plot of each country’s GDP per capita (ascending order).

ggplot(data=gdp.change,mapping=aes(x=reorder(country,gdp.change),y=gdp.change,color=continent))+
  geom_point(size=1.2)+
  geom_line(data=gdp.change,mapping=aes(x=country,y=avg.gdp.change,group=1),color='grey')+
  geom_line(data=gdp.change,mapping=aes(x=country,y=avg.gdp.change.c,group=1),alpha=0.8)+
  facet_wrap('continent',scales = 'free',nrow=3)+
  labs(x = 'Country', y = 'Growth Percentage', title='GDP per Capita Growth Percentage \n (from 1952 to 2007)',caption='Figure 4.1')+
  theme(axis.text.x=element_text(angle = 90,hjust=1),plot.title = element_text(hjust = 0.5),plot.caption = element_text(hjust = 0.5))

The grey reference lines show the average growth percentage all over the world, and the lines with different colors show the averages for different continents.

By comparing the two reference lines, we could see that Asia and Europe exceeded the global average, and Africa, Americas and Oceania were sub-average.

However, there are several countries go against the overall increasing growth trend. Let us make a list for these countries:

# show a table for countries with decreased GDP per capita
GDP.decrease=subset(gdp.change,gdp.change<0)
GDP.decrease=GDP.decrease[,c(1,5,6)]
kable(GDP.decrease,col.names = c('country','continent','GDP per capita growth percentage'))
country continent GDP per capita growth percentage
8 Central African Republic Africa -0.3409787
10 Comoros Africa -0.1059329
11 Congo, Dem. Rep. Africa -0.6444115
14 Djibouti Africa -0.2199069
26 Liberia Africa -0.2798353
28 Madagascar Africa -0.2759795
36 Niger Africa -0.1866470
42 Sierra Leone Africa -0.0196036
43 Somalia Africa -0.1845554
65 Haiti Americas -0.3470665
69 Nicaragua Americas -0.1166454

We could see from the table that most of the countries with negative growth of economy were African countries.

Now let us plot the growth percentages of each country’s life expectancy (ascending order).

ggplot(data=lifeExp.change,mapping=aes(x=reorder(country,lifeExp.change),y=lifeExp.change,color=continent))+
  geom_point(size=1.2)+
  geom_line(data=lifeExp.change,mapping=aes(x=country,y=avg.lifeExp.change,group=1),color='grey')+
  geom_line(data=lifeExp.change,mapping=aes(x=country,y=avg.lifeExp.change.c,group=1),alpha=0.8)+
  facet_wrap('continent',scales = 'free',nrow=3)+
  labs(x = 'Country', y = 'Growth Percentage', title='Life Expectancy Growth Percentage \n (from 1952 to 2007)',caption='Figure 4.2')+
  theme(axis.text.x=element_text(angle = 90,hjust=1),plot.title = element_text(hjust = 0.5),plot.caption = element_text(hjust = 0.5))

We could see from the plot that the overall trend of the life expectancy all over the world is increasing, too. Africa, America and Asia exceeded the overall average, and Europe and Oceania were subaverage.

Similarly, let us list the countries with negative growth percentage:

# show a table for countries with decreased life expectancy
lifeExp.decrease=subset(lifeExp.change,lifeExp.change<0) 
lifeExp.decrease=lifeExp.decrease[,c(1,5,6)] 
kable(lifeExp.decrease,col.names = c('country','continent','life expectancy growth percentage'))
country continent life expectancy growth percentage
46 Swaziland Africa -0.043326
52 Zimbabwe Africa -0.102454

The table shows that all of the countries with negative growth of life expectancy were African countries.


5. Dashborad Design

In this section, I would design an interactive dashboard to show the time series data and analysis by using googleVis library.

# set option to show the chart only
op <- options(gvis.plot.tag='chart')
# set options back to original options 
options(op)

These combined motion charts could be served as an interactive dashboard for users to explore the time series data. Users may select the variables for the x-axis and y-axis, and also could select the variables in Color (like continent) and Size (like population). You can Select only some certain countries to show, and also by checking the Trails option, you could see the track of that country over time. One more thing to say about the dashboard is that, if you select gdp.per.capita to be the x-axis values and life.expectancy as the y-axis values, by selecting Log instead of “Lin”, you should get the similar plot like we have plotted above to show the logarithmic relationship between the two variables.


6. Conclusion:

6.1 Summarize

This project mainly explores some questions about the world health and wealth problems. I believe that some findings would help us have better understanding and deeper insights about how the world looks like over the recent 100 years and how it evolves. It would help us learn from the history so that we could get improvement in the future.

In Section 1, we find out four different clusters (shown in Figure 1.2) for all of the countries in the world according to their population evolution process for the entire time span. Demographics change may resulted from the global developed economy and living quality. So we further study on the improvement on life expectancy and GDP per capita for each continent in Section 2 and Section 3, and also for each country in Section 4.

From the plots in Section 2 and Section 3, we could say that in the world the overall levels of both health condition and economy were rising (from 1952 to 2007). But there were some individual countries that went against this global pattern (as listed in the tables in Section 4). Some countries (most came from Africa) have shown a negative economic growth and some (all came from Africa) even suffered a health crisis, but there was no countries showing negative growth in both life expectancy and GDP per capita.

In Section 4, Figure 4.1 shows that Europe and Asia have made a large contribution to the global economic growth. Especially the economic rise of Asia was really amazing. Also, as for the improvement on life expectancy, Asia again played the most important role in improving the global health condition (shown in Figure 4.2). We could assume that the positive growth of economic would to some degree raise the living standard and result in a positive growth of life expectancy. And indeed we could read from Figure 3.1 that compared with the global status in 1952, we became to have a continuous world in 2007: high-income with high life expectancy, middle-income with middle life expectancy, low-income with low life expectancy. The world has become healthier and richer. But the gap between the richest and the poorest was getting even more enormous.

6.2 Limitations

The raw dataset in this project was not quite consistent and “clean”. In this project, sometimes I just simply omit/remove some NA values but maybe I should use other better ways to deal with these problems. I would like to learn more about data cleansing and how to process the unconsistent data.

What is more, for now I have not found better way to generate the infograph directly from R. I have tried my best to add some text notations right in the figures to make them more like an infograph. In the future I would like to learn more about creating infographs using R.

6.3 Tool selection

In this project, I use R to do all of the work including reading data, data cleansing, data transformation, data analysis, data visualization, creating dashboards and also generate the documentation. Some work might be more easily to do by using other tools. Like creating dashboards in Tableau, or adding texts to the figures using Power Point. Since I prefer to use codes to do all of the work (because they are re-usable in the future), I just have tried my best to take advantage of different libraries in R to do all of the work. However, it would be also amazing and time-saving to utilize different tools including SQL, Excel, Tableau, PPT and etc. and combine these work together to complete a whole project.